{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/qdrant_hybrid.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Qdrant Hybrid Search\n",
    "\n",
    "Qdrant supports hybrid search by combining search results from `sparse` and `dense` vectors.\n",
    "\n",
    "`dense` vectors are the ones you have probably already been using -- embedding models from OpenAI, BGE, SentenceTransformers, etc. are typically `dense` embedding models. They create a numerical representation of a piece of text, represented as a long list of numbers. These `dense` vectors can capture rich semantics across the entire piece of text.\n",
    "\n",
    "`sparse` vectors are slightly different. They use a specialized approach or model (TF-IDF, BM25, SPLADE, etc.) for generating vectors. These vectors are typically mostly zeros, making them `sparse` vectors. These `sparse` vectors are great at capturing specific keywords and similar small details.\n",
    "\n",
    "This notebook walks through setting up and customizing hybrid search with Qdrant and `naver/efficient-splade-VI-BT-large` variants from Huggingface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, we setup our env and load our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-qdrant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index qdrant-client pypdf \"transformers[torch]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/'\n",
    "!wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\"./data/\").load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing Data\n",
    "\n",
    "Now, we can index our data. \n",
    "\n",
    "Hybrid search with Qdrant must be enabled from the beginning -- we can simply set `enable_hybrid=True`.\n",
    "\n",
    "This will run sparse vector generation locally using the `\"naver/efficient-splade-VI-BT-large-doc\"` model from Huggingface, in addition to generating dense vectors with OpenAI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, StorageContext\n",
    "from llama_index.core import Settings\n",
    "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
    "from qdrant_client import QdrantClient\n",
    "\n",
    "# creates a persistant index to disk\n",
    "client = QdrantClient(path=\"./qdrant_data\")\n",
    "\n",
    "# create our vector store with hybrid indexing enabled\n",
    "# batch_size controls how many nodes are encoded with sparse vectors at once\n",
    "vector_store = QdrantVectorStore(\n",
    "    \"llama2_paper\", client=client, enable_hybrid=True, batch_size=20\n",
    ")\n",
    "\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "Settings.chunk_size = 512\n",
    "\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents,\n",
    "    storage_context=storage_context,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hybrid Queries\n",
    "\n",
    "When querying with hybrid mode, we can set `similarity_top_k` and `sparse_top_k` separately.\n",
    "\n",
    "`sparse_top_k` represents how many nodes will be retrieved from each dense and sparse query. For example, if `sparse_top_k=5` is set, that means I will retrieve 5 nodes using sparse vectors and 5 nodes using dense vectors.\n",
    "\n",
    "`similarity_top_k` controls the final number of returned nodes. In the above setting, we end up with 10 nodes. A fusion algorithm is applied to rank and order the nodes from different vector spaces ([relative score fusion](https://weaviate.io/blog/hybrid-search-fusion-algorithms#relative-score-fusion) in this case). `similarity_top_k=2` means the top two nodes after fusion are returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine(\n",
    "    similarity_top_k=2, sparse_top_k=12, vector_store_query_mode=\"hybrid\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Llama2 was specifically trained differently from Llama1 by making several changes to improve performance. These changes included performing more robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "response = query_engine.query(\n",
    "    \"How was Llama2 specifically trained differently from Llama1?\"\n",
    ")\n",
    "\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    }
   ],
   "source": [
    "print(len(response.source_nodes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets compare to not using hybrid search at all!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Llama 2 was specifically trained differently from Llama 1 by making several changes to improve performance. These changes included performing more robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. These modifications were made to enhance the training process and optimize the performance of Llama 2 compared to Llama 1."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    similarity_top_k=2,\n",
    "    # sparse_top_k=10,\n",
    "    # vector_store_query_mode=\"hybrid\"\n",
    ")\n",
    "\n",
    "response = query_engine.query(\n",
    "    \"How was Llama2 specifically trained differently from Llama1?\"\n",
    ")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Async Support\n",
    "\n",
    "And of course, async queries are also supported (note that in-memory Qdrant data is not shared between async and sync clients!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, StorageContext\n",
    "from llama_index.core import Settings\n",
    "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
    "from qdrant_client import AsyncQdrantClient\n",
    "\n",
    "\n",
    "# creates a persistant index to disk\n",
    "aclient = AsyncQdrantClient(path=\"./qdrant_data_async\")\n",
    "\n",
    "# create our vector store with hybrid indexing enabled\n",
    "vector_store = QdrantVectorStore(\n",
    "    collection_name=\"llama2_paper\",\n",
    "    aclient=aclient,\n",
    "    enable_hybrid=True,\n",
    "    batch_size=20,\n",
    ")\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "Settings.chunk_size = 512\n",
    "\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents,\n",
    "    storage_context=storage_context,\n",
    "    use_async=True,\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(similarity_top_k=2, sparse_top_k=10)\n",
    "\n",
    "response = await query_engine.aquery(\n",
    "    \"What baseline models are measured against in the paper?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Advanced] Customizing Hybrid Search with Qdrant\n",
    "\n",
    "In this section, we walk through various settings that can be used to fully customize the hybrid search experience\n",
    "\n",
    "### Customizing Sparse Vector Generation\n",
    "\n",
    "By default, sparse vector generation is done using seperate models for queries and documents -- `\"naver/efficient-splade-VI-BT-large-doc\"` and `\"naver/efficient-splade-VI-BT-large-query\"`\n",
    "\n",
    "Below is the default code for generating the sparse vectors and how you can set the functionality in the constructor. You can use this and customize as needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any, List, Tuple\n",
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForMaskedLM\n",
    "\n",
    "doc_tokenizer = AutoTokenizer.from_pretrained(\n",
    "    \"naver/efficient-splade-VI-BT-large-doc\"\n",
    ")\n",
    "doc_model = AutoModelForMaskedLM.from_pretrained(\n",
    "    \"naver/efficient-splade-VI-BT-large-doc\"\n",
    ")\n",
    "\n",
    "query_tokenizer = AutoTokenizer.from_pretrained(\n",
    "    \"naver/efficient-splade-VI-BT-large-query\"\n",
    ")\n",
    "query_model = AutoModelForMaskedLM.from_pretrained(\n",
    "    \"naver/efficient-splade-VI-BT-large-query\"\n",
    ")\n",
    "\n",
    "\n",
    "def sparse_doc_vectors(\n",
    "    texts: List[str],\n",
    ") -> Tuple[List[List[int]], List[List[float]]]:\n",
    "    \"\"\"\n",
    "    Computes vectors from logits and attention mask using ReLU, log, and max operations.\n",
    "    \"\"\"\n",
    "    tokens = doc_tokenizer(\n",
    "        texts, truncation=True, padding=True, return_tensors=\"pt\"\n",
    "    )\n",
    "    if torch.cuda.is_available():\n",
    "        tokens = tokens.to(\"cuda\")\n",
    "\n",
    "    output = doc_model(**tokens)\n",
    "    logits, attention_mask = output.logits, tokens.attention_mask\n",
    "    relu_log = torch.log(1 + torch.relu(logits))\n",
    "    weighted_log = relu_log * attention_mask.unsqueeze(-1)\n",
    "    tvecs, _ = torch.max(weighted_log, dim=1)\n",
    "\n",
    "    # extract the vectors that are non-zero and their indices\n",
    "    indices = []\n",
    "    vecs = []\n",
    "    for batch in tvecs:\n",
    "        indices.append(batch.nonzero(as_tuple=True)[0].tolist())\n",
    "        vecs.append(batch[indices[-1]].tolist())\n",
    "\n",
    "    return indices, vecs\n",
    "\n",
    "\n",
    "def sparse_query_vectors(\n",
    "    texts: List[str],\n",
    ") -> Tuple[List[List[int]], List[List[float]]]:\n",
    "    \"\"\"\n",
    "    Computes vectors from logits and attention mask using ReLU, log, and max operations.\n",
    "    \"\"\"\n",
    "    # TODO: compute sparse vectors in batches if max length is exceeded\n",
    "    tokens = query_tokenizer(\n",
    "        texts, truncation=True, padding=True, return_tensors=\"pt\"\n",
    "    )\n",
    "    if torch.cuda.is_available():\n",
    "        tokens = tokens.to(\"cuda\")\n",
    "\n",
    "    output = query_model(**tokens)\n",
    "    logits, attention_mask = output.logits, tokens.attention_mask\n",
    "    relu_log = torch.log(1 + torch.relu(logits))\n",
    "    weighted_log = relu_log * attention_mask.unsqueeze(-1)\n",
    "    tvecs, _ = torch.max(weighted_log, dim=1)\n",
    "\n",
    "    # extract the vectors that are non-zero and their indices\n",
    "    indices = []\n",
    "    vecs = []\n",
    "    for batch in tvecs:\n",
    "        indices.append(batch.nonzero(as_tuple=True)[0].tolist())\n",
    "        vecs.append(batch[indices[-1]].tolist())\n",
    "\n",
    "    return indices, vecs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = QdrantVectorStore(\n",
    "    \"llama2_paper\",\n",
    "    client=client,\n",
    "    enable_hybrid=True,\n",
    "    sparse_doc_fn=sparse_doc_vectors,\n",
    "    sparse_query_fn=sparse_query_vectors,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Customizing `hybrid_fusion_fn()`\n",
    "\n",
    "By default, when running hbyrid queries with Qdrant, Relative Score Fusion is used to combine the nodes retrieved from both sparse and dense queries. \n",
    "\n",
    "You can customize this function to be any other method (plain deduplication, Reciprocal Rank Fusion, etc.).\n",
    "\n",
    "Below is the default code for our relative score fusion approach and how you can pass it into the constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.vector_stores import VectorStoreQueryResult\n",
    "\n",
    "\n",
    "def relative_score_fusion(\n",
    "    dense_result: VectorStoreQueryResult,\n",
    "    sparse_result: VectorStoreQueryResult,\n",
    "    alpha: float = 0.5,  # passed in from the query engine\n",
    "    top_k: int = 2,  # passed in from the query engine i.e. similarity_top_k\n",
    ") -> VectorStoreQueryResult:\n",
    "    \"\"\"\n",
    "    Fuse dense and sparse results using relative score fusion.\n",
    "    \"\"\"\n",
    "    # sanity check\n",
    "    assert dense_result.nodes is not None\n",
    "    assert dense_result.similarities is not None\n",
    "    assert sparse_result.nodes is not None\n",
    "    assert sparse_result.similarities is not None\n",
    "\n",
    "    # deconstruct results\n",
    "    sparse_result_tuples = list(\n",
    "        zip(sparse_result.similarities, sparse_result.nodes)\n",
    "    )\n",
    "    sparse_result_tuples.sort(key=lambda x: x[0], reverse=True)\n",
    "\n",
    "    dense_result_tuples = list(\n",
    "        zip(dense_result.similarities, dense_result.nodes)\n",
    "    )\n",
    "    dense_result_tuples.sort(key=lambda x: x[0], reverse=True)\n",
    "\n",
    "    # track nodes in both results\n",
    "    all_nodes_dict = {x.node_id: x for x in dense_result.nodes}\n",
    "    for node in sparse_result.nodes:\n",
    "        if node.node_id not in all_nodes_dict:\n",
    "            all_nodes_dict[node.node_id] = node\n",
    "\n",
    "    # normalize sparse similarities from 0 to 1\n",
    "    sparse_similarities = [x[0] for x in sparse_result_tuples]\n",
    "    max_sparse_sim = max(sparse_similarities)\n",
    "    min_sparse_sim = min(sparse_similarities)\n",
    "    sparse_similarities = [\n",
    "        (x - min_sparse_sim) / (max_sparse_sim - min_sparse_sim)\n",
    "        for x in sparse_similarities\n",
    "    ]\n",
    "    sparse_per_node = {\n",
    "        sparse_result_tuples[i][1].node_id: x\n",
    "        for i, x in enumerate(sparse_similarities)\n",
    "    }\n",
    "\n",
    "    # normalize dense similarities from 0 to 1\n",
    "    dense_similarities = [x[0] for x in dense_result_tuples]\n",
    "    max_dense_sim = max(dense_similarities)\n",
    "    min_dense_sim = min(dense_similarities)\n",
    "    dense_similarities = [\n",
    "        (x - min_dense_sim) / (max_dense_sim - min_dense_sim)\n",
    "        for x in dense_similarities\n",
    "    ]\n",
    "    dense_per_node = {\n",
    "        dense_result_tuples[i][1].node_id: x\n",
    "        for i, x in enumerate(dense_similarities)\n",
    "    }\n",
    "\n",
    "    # fuse the scores\n",
    "    fused_similarities = []\n",
    "    for node_id in all_nodes_dict:\n",
    "        sparse_sim = sparse_per_node.get(node_id, 0)\n",
    "        dense_sim = dense_per_node.get(node_id, 0)\n",
    "        fused_sim = alpha * (sparse_sim + dense_sim)\n",
    "        fused_similarities.append((fused_sim, all_nodes_dict[node_id]))\n",
    "\n",
    "    fused_similarities.sort(key=lambda x: x[0], reverse=True)\n",
    "    fused_similarities = fused_similarities[:top_k]\n",
    "\n",
    "    # create final response object\n",
    "    return VectorStoreQueryResult(\n",
    "        nodes=[x[1] for x in fused_similarities],\n",
    "        similarities=[x[0] for x in fused_similarities],\n",
    "        ids=[x[1].node_id for x in fused_similarities],\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = QdrantVectorStore(\n",
    "    \"llama2_paper\",\n",
    "    client=client,\n",
    "    enable_hybrid=True,\n",
    "    hybrid_fusion_fn=relative_score_fusion,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may have noticed the alpha parameter in the above function. This can be set directely in the `as_query_engine()` call, which will set it in the vector index retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index.as_query_engine(alpha=0.5, similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Customizing Hybrid Qdrant Collections\n",
    "\n",
    "Instead of letting llama-index do it, you can also configure your Qdrant hybrid collections ahead of time.\n",
    "\n",
    "**NOTE:** The names of vector configs must be `text-dense` and `text-sparse` if creating a hybrid index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from qdrant_client import models\n",
    "\n",
    "client.recreate_collection(\n",
    "    collection_name=\"llama2_paper\",\n",
    "    vectors_config={\n",
    "        \"text-dense\": models.VectorParams(\n",
    "            size=1536,  # openai vector size\n",
    "            distance=models.Distance.COSINE,\n",
    "        )\n",
    "    },\n",
    "    sparse_vectors_config={\n",
    "        \"text-sparse\": models.SparseVectorParams(\n",
    "            index=models.SparseIndexParams()\n",
    "        )\n",
    "    },\n",
    ")\n",
    "\n",
    "# enable hybrid since we created a sparse collection\n",
    "vector_store = QdrantVectorStore(\n",
    "    collection_name=\"llama2_paper\", client=client, enable_hybrid=True\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index-4a-wkI5X-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
